Overview

sniper.gif pubg_map.jpg
Left: A skilled sniper taking out a moving target. Right: A PUBG map.

Motivation

Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.

We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.

Through our analysis, we aim to understand which playing strategies are more successful than others: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.

Initial Questions

First, we want to investigate how well we can predict a player’s placement based on their in-game actions. What actions or statistics are most predictive of their placement? Exploring this question can then provide insight into how different playing styles compare. We would like to be able to build a model that accurately predicts a player’s game performance, but also allows us to draw inferences about whether certain playing styles are more successful.

Data

The data comes from the Kaggle competition. To download the data, join the Kaggle competition and run the shell script download_data.sh.

Note: We will need to provide a direct download link for the TA.

data.url <- paste0("https://www.dropbox.com/s/319vkfevkfb6kqt/all.zip?dl=1")

if(!file.exists("./data/pubg.zip")){
  dir.create("./data")
  download.file(data.url, destfile = "./data/pubg.zip", mode = "wb")
  unzip("./data/pubg.zip", exdir = "./data/pubg")
}
# Warning: Very large datasets. Read 10000 samples before scaling up.
raw_dat  <- read_csv("data/pubg/train_V2.csv", n_max = 10000)

test_dat <- read_csv("data/pubg/test_V2.csv")

Variables

Each row in the data contains one player’s post-game stats. A description of all data fields is provided in pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 15% of the data. The outcome variable we are trying to predict is win_place_perc.

# Select single-player data only
#   Clean names
#   Remove features that are not relevant to single-players
#   Change player_id and match_id to factors
clean_dat <- raw_dat %>% 
  clean_names() %>%
  filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
  select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%
  mutate(id = as.factor(id), match_id = as.factor(match_id))

Training and Test Set

We are given a training set and a test set. The outcome variable for the test set will not be given to us until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the training set, we will create our own training and test set.

# Split into train and test set
train_ind = createDataPartition(y = clean_dat$win_place_perc, p = 0.8, list = F)
train = clean_dat %>%
  slice(train_ind)
test = clean_dat %>%
  slice(-train_ind)
head(train)
# A tibble: 6 x 23
  id    match_id boosts damage_dealt headshot_kills heals kill_place
  <fct> <fct>     <int>        <dbl>          <int> <int>      <int>
1 315c… 6dc8ff8…      0       100                 0     0         45
2 311b… 2926117…      0         8.54              0     0         48
3 b780… 2c30ddf…      1       324.                1     5          5
4 9202… 07948d7…      3       254.                0    12         13
5 4714… bc2faec…      0       137.                0     0         37
6 0ba4… f7cb761…      0       194.                1     1         19
# ... with 16 more variables: kill_points <int>, kills <int>,
#   kill_streaks <int>, longest_kill <dbl>, match_duration <int>,
#   max_place <int>, num_groups <int>, rank_points <int>,
#   ride_distance <dbl>, road_kills <int>, swim_distance <dbl>,
#   vehicle_destroys <int>, walk_distance <dbl>, weapons_acquired <int>,
#   win_points <int>, win_place_perc <dbl>

Exploratory Data Analysis

Distribution of Features by Finish Percentile

We first explored the distribution of each feature by the final finish percentile. Individuals were first classified into 0-19th, 20th-39th, 40th-59th, 60th-79th, 80th-99th, and 100th (winners) percentile finish across all games. Then, we plotted the density of features by these percentiles.

It is important to note that the density plots aggregate by percent finish in a game. Thus, it is possible for one individual to place within the 10th percentile in one game, but then finish in the 90th percentile in another. This individual would contribute to the approximated density for both the 0-19th percentile and the 80th-99th percentile.

train %>% mutate(win_place_cat = as.factor(floor(win_place_perc / 2 * 10) * 20)) %>%
  gather("feature", "value", -match_id, -match_duration, 
         -id, -win_place_perc, -win_place_cat) %>%
  ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
  facet_wrap(feature ~., scales = "free") +
  geom_density() +
  labs(title = "Distribution of Features by Finish Percentile", 
       x = "Value of Features", y = "Density", color = "Percentile") +
  scale_color_hue(labels = c("0-19", "20-39", "40-59", "60-79", "80-99", "100")) +
  theme_bw()

This plot has some very interesting features:

  • In general, we see that as the finish percentile increases, the distribution of each feature shifts rightward.
  • The graph of boost and heals used suggests that players who use more boosts or healing items are likely to last longer in the game. This makes intuitive sense as boosts enable players to have increased passive health regeneration and movement speed, and healing items regain health.
  • For damage_dealt, we see similar differences among players by their finish percentile. Winner (e.g. 100th percentile) have a broad distribution in damage dealt suggesting that some solo players may win by not having high damage dealt while other deal significantly more damage. It is important to note that winners must have killed at least one individual. Thus, it is expected that the damage dealt distribution for winners is shifted to the right in comparison to players with lower finish percentiles.
  • kill_place, kill_points, kills, and win_points follow bimodal distributions. This may reflect the play-styles of each player. Players who land in populated areas are more likely to encounter other players, resulting in a higher porbability of dying or a larger number of kills if the player survives. Thus, we can partition players in the 10th percentile finish into two categories: a skilled player who but dies early due to dropping in a populated location, but due to their skill acquires a large number of kills or a less-skilled player who dies early due to lack of skill despite dropping in a less populated location.
  • Some features look highly skewed (e.g. longest_kill, ride_distance, swim_distance, ride_distance, etc.). We may want to log-transform these variables in our model building.
  • The num_groups density plots suggest that in games where we have little data, we tend to have data on the winners. Thus, there may be some imbalance in the data we will need to either adjust for to ensure that our model doesn’t overestimate finish percentile.
  • win_points and kill_points are external characteristics (from previous games) that attempt to characterize the skill level of a player. These distributions are bimodal which may reflect the extremes of the two playstyles described above. It seems that kill_points has more predictive value of finish percentile as the right-shift is more distinct by finish percentage category than win_points. Interestingly, rank_points suggests that prior-game ranks do not have a large impact on the final placement in a game. This makes sense since in-gamve variables like drop location, loot, and circle movement can affect how likely an individual is to win.

Additional plots we might want:

  • Duration of game versus range of kills? (Shorter games might mean more people dropped in similar locations, etc. )
corr_matrix = test %>% select(-id, -match_id) %>% cor()
corrplot(corr_matrix, method = "circle")

# Data Analysis (Modeling)

Narrative and Summary